Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression

نویسندگان

  • Takuya Takagi
  • Keisuke Goto
  • Yuta Fujishige
  • Shunsuke Inenaga
  • Hiroki Arimura
چکیده

In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (LCDAWGs), which can be represented with O(ẽT log n) bits of space allowing for O(log n)-time random and O(1)-time sequential accesses to edge labels, and O(m log σ+ occ)-time pattern matching. Here, ẽT is the number of all extensions of maximal repeats in T , n and m are respectively the lengths of the text T and a given pattern, σ is the alphabet size, and occ is the number of occurrences of the pattern in T . The repetitiveness measure ẽT is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve O(m + occ) pattern matching time with O(erT log n) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of log log n, with the same space complexity. Here, erT is the number of right extensions of maximal repeats in T . As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size O(ẽT ) for a given text T in O(n+ ẽT log σ) time.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Composite Repetition-Aware Data Structures

In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time ...

متن کامل

Indexing Straight-Line Programs∗

Straight-line programs offer powerful text compression by representing a text T [1, u] in terms of a context-free grammar of n rules, so that T can be recovered in O(u) time. However, the problem of operating the grammar in compressed form has not been studied much. We present the first grammar representation able of extracting text substrings, and of searching the text for patterns, in time o(...

متن کامل

Compact Directed Acyclic Word Graphs for a Sliding Window

The suffix tree is a well-known and widely-studied data structure that is highly useful for string matching. The suffix tree of a string w can be constructed in O(n) time and space, where n denotes the length of w. Larsson achieved an efficient algorithm to maintain a suffix tree for a sliding window. It contributes to prediction by partial matching (PPM) style statistical data compression sche...

متن کامل

On-Line Construction of Compact Directed Acyclic Word Graphs

A Compact Directed Acyclic Word Graph (CDAWG) is a space–efficient text indexing structure, that can be used in several different string algorithms, especially in the analysis of biological sequences. In this paper, we present a new on–line algorithm for its construction, as well as the construction of a CDAWG for a set of strings.

متن کامل

Approximation of smallest linear tree grammar

A simple linear-time algorithm for constructing a linear context-free tree grammar of size O(rg + rg log(n/rg)) for a given input tree T of size n is presented, where g is the size of a minimal linear context-free tree grammar for T , and r is the maximal rank of symbols in T (which is a constant in many applications). This is the first example of a grammar-based tree compression algorithm with...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017